The PC-SIG Library 10

home *** CD-ROM | disk | FTP | other *** search

/ The PC-SIG Library 10 / The PC-Sig Library - Shareware for the IBM PC and Compatibles (PC-SIG)(Tenth Edition Disks 1-2804)(1991).iso / PC_SIGCD / 07 / 3 / DISK0731.ZIP / INDEXREF < prev next >

Wrap

Text File | 1987-02-09 | 6KB | 123 lines

REFERENCE INDEX(2) INDEX generates what is technically called an "inverted index". Whereas, the disk itself is organized as "files containing words", the index is organized as "words containing files". While the notion is simple, the implementation is faced with many issues. The largest issue is one of space. A simple list of words with each file name containing the word and a sub list of file names would be many times the size of the data to be indexed. The first economy of space is to build a map of file names to file numbers. Thus a two byte file number can stand for each instance of a file name in the index. File names are technically restricted to 64 characters. In practice, they are much shorter. Still the use of hierarchical directories requires a list of directory names to describe the path to a file. INDEX builds a "map" which links file names to directory names in the path. To attack the space problem further, the list of words itself must be attacked. First, INDEX, deals only with letters. No digits or punctuation are indexed. Second, case is ignored. All lower case letters are converted to upper case. Third, words less than 3 characters in length are discarded and words greater than 7 characters in length are significant only in the first 7 characters. Fourth, each word is discarded if it appears in the list of "common" words. Each of these methods gives up something in the retrieval phase to reduce the size of the index. Still, the list of unique words indexed is large. Acronyms and proper names added to "normal" words add up to perhaps hundreds of thousands of possible words. The memory size of most personal computers limits the number of words in memory to tens of thousands words at best. The second issue is one of speed. Unless indexing were performed at an operating system level, an index can only represent the state of the disk at one particular time. Thus it becomes necessary to re-index data periodically. While microcomputer systems are idle substantial periods of time, it is desirable to index quickly. Retrieval requires that the list of files associated with a key word be accessible quickly as well. INDEX thus makes the assumption that there are only 4093 unique words. To reduce the set of unique words to 4093, a mathematical operation (called hashing) is applied to each word generating a number which is then divided by 4093, the remainder standing for the word. Further, even though a file may contain many instances of keywords generating a particular number, only one such instance needs to be recorded. This greatly reduces the space and time required to generate an inverted index. The problem created by this assumption is that two different keywords may generate the same number. Fortunately, statistics are on our side and when two or more keywords are used for a retrieval, the probability of two files containing keywords generating the same numbers is small. Consequently, INDEX generates a file containing one entry for each unique keyword number which in turn points to a file containing one file number for each unique keyword number, file number pair. The following is a general description of the program flow. INDEX performs initialization, scans the current directory, and then wraps up. Initialization consists of writing the copyright notice and version, processing the parameters, initializing global variables, reading the EXTENSIO.INX file, reading the WORD.INX file and finally initializing output files (MAP.INX, ENTRY.INX, and KEY.INX). To scan the current directory, the MS-DOS function "find first file" is issued. If that is successful, the file name is parsed and the file is analyzed. While successive calls to "find next file" are successful, each file is analyzed. To analyze a file, the directory is checked to determine if the file is a directory or a file. If it is a directory and is not "." or ".." then a "change directory" is performed and the new directory is scanned recursively. The nesting level is incremented each time a sub-directory is encountered and decremented as each directory scan is completed. Each directory scanned generates an entry into the MAP.INX file. If a file is not a directory (i.e. is an ordinary file) and its extension is not found in the list of excludable extensions, it is added to MAP.INX pointing back to the parent directory. The file name is displayed on the console and then the file itself is scanned. To scan a file, a new ordered lists of codes is created and initialized (unless this is a second or subsequent pass in which case a list is reused). After the file is opened, it is scanned as blocks of characters. Any type of file (i.e. code, data, text, etc...) is treated the same way. Lower case letters are converted to upper case. File scanning has two states; either it is assembling a word or it is not. If it is not assembling words, characters are ignored until a letter is encountered in which case it switches state. If it is assembling words, each letter encountered effects a mathematical operation which generates a number. Finding a non letter terminates assembly of the word. The word is searched in the common word list. If found, the word is discarded and the scan continues. If not found the number is folded onto the range of distinct words and the code for the word is entered into a sorted list. Scanning continues until the end of file is reached. Upon reaching the end of a file, a lack of memory will trigger writing or updating the index file. Wrapup consists of writing an index, closing the files, and producing a brief report. Writing the index requires inverting the collected data which was collected and file order but needs to be retrieved in keyword code order. For each possible keyword code the list by file of keyword codes is scanned. Each file containing the current keyword code is written to KEY.INX. At the end of each keyword code the current position in KEY.INX is written to ENTRY.INX.